encodeURI and unicode

Csaba Gabor

If I do alert(encodeURI(String.fromCharCode(250)));
(in FF 1.5+ or IE6 on my winXP Pro) then I get: %C3%BA

Now I was sort of expecting something like %u... (and a single (4
digit?) unicode hex character num). Is that something for the future,
or am I guaranteed that all % encodings (from encodeURI) will have
exactly two hex digits following?

Perhaps someone could shed some light on this or point me to quality
site. Be gentle, I know almost nothing about unicode.

Thanks,
Csaba Gabor from Vienna
alert(encodeURI(String.fromCharCode(2500))) => %E0%A7%84
alert(encodeURI(String.fromCharCode(25000))) => %E6%86%A8

Mar 17 '06 #1

Subscribe Post Reply

5047

Csaba Gabor

Csaba Gabor wrote:

If I do alert(encodeURI(String.fromCharCode(250)));
(in FF 1.5+ or IE6 on my winXP Pro) then I get: %C3%BA

Now I was sort of expecting something like %u... (and a single (4
digit?) unicode hex character num). Is that something for the future,

OK, I think I have most it it now. I was confusing encodeURI with what
I had earlier read at this site:
http://html.megalink.com/programmer/...sTabChars.html

but that is covering how to specify javascript (1.3) strings and not
what happens with encodeURI. I presume this is a reflection of the
spec that browsers must follow in transmitting information to servers.
Still, I was a little surprised.

Here is another interesting point:
var a=String.fromCharCode(131071);
alert(a.charCodeAt(0)+"\n"+a);

That code shows a char code of 65535, and if use 131072 then the char
code goes to 0. In other words, it wraps.

I just have one question at this point. As I mentioned in my original
post,
String.fromCharCode(2500) == "\u09C4" => %E0%A7%84
The first equivalence is easy since 9C4 is the hex representation of
(decimal) 2500. But how do we get to the encodeURI output on the
right?

Csaba

Mar 17 '06 #2

Thomas 'PointedEars' Lahn

Csaba Gabor wrote:

I just have one question at this point. As I mentioned in my original
post, String.fromCharCode(2500) == "\u09C4" => %E0%A7%84
The first equivalence is easy since 9C4 is the hex representation of
(decimal) 2500. But how do we get to the encodeURI output on the
right?

Those are percent-escaped representations of the three UTF-8 code
units that are required to encode the Unicode character at code
point U+09C4. See also ECMAScript 3 Final, subsection 15.1.3, and
<URL:http://people.w3.org/rishida/scripts/uniview/conversion>.
PointedEars

Mar 18 '06 #3

Csaba Gabor

Thomas 'PointedEars' Lahn wrote:

Csaba Gabor wrote:
I just have one question at this point. As I mentioned in my original
post, String.fromCharCode(2500) == "\u09C4" => %E0%A7%84

Those are percent-escaped representations of the three UTF-8 code ...
<URL:http://people.w3.org/rishida/scripts/uniview/conversion>.

Thanks Thomas, I like links.
It let me figure out the unicode / UTF8 mapping.
He's got a function, convertCP2UTF8 (spaceSeparatedHexValues) that does
essentially:

n = ...unicodeValue...
if (n <= 0x7F) return dec2hex2(n);
else if (n <= 0x7FF) return
dec2hex2(0xC0 | ((n>>6) & 0x1F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else if (n <= 0xFFFF) return
dec2hex2(0xE0 | ((n>>12) & 0x0F)) + ' ' +
dec2hex2(0x80 | ((n>>6) & 0x3F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else if (n <= 0x10FFFF) return
dec2hex2(0xF0 | ((n>>18) & 0x07)) + ' ' +
dec2hex2(0x80 | ((n>>12) & 0x3F)) + ' ' +
dec2hex2(0x80 | ((n>>6) & 0x3F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else return '!erreur ' + dec2hex(n);
In words: If your positive integer (the char code) is not less than
17*16^4, report an error,
and If it is 7 bits or less (in the range (2^7,0], that is), just
return the two digit hex representation.

Otherwise, let k be the number of bits in your number. That is to say,
k is the smallest integer such that 2^k is greater than your number -
e.g. [2^(k-1),2^k)->k; [128,256)->8; [8,16)->4; [4,8)->3; [2,4)->2;
1->1; 0->0). Now, starting at the low end, section the number into
m=ceiling((k-1)/5) groups of 6 bits, with any leftovers in the final
(high) group. Prefix all but the high groups with (bits) 10 (that is
to say, OR them with (hex) 80). Prefix the high group with the m+1
bits corresponding to 2^(m+1)-2. That is to say, prefix the first
group of 2 with (bits) 110, the first group of 3 with 1110, or the
first group of 4 with 11110.

Thus, if your number has 7 bits or less, it takes two hex digits to
represent. From 8 to 11 (inclusive) it takes four hex digits, from 12
to 16 (inclusive) it takes six, and from 17 to 21 (inclusive) bits it
takes eight hex digits to represent.

Example: 2500 -> 0x9C4 ->
1001 1100 0100 so k=12 and m=3 ->
(0000) 100111 000100 (that first group got no bits so it is implied) ->
(1110)0000 (10)100111 (10)000100 ->
E0 A7 84

With this it's also easy to see how to work from UTF-8 to unicode.
Given a byte, scan for (from the high (left) side, the first 0 bit).
If the high bit is 0, you are done and you have a "normal" character.
Otherwise, the character is specified by the next m bytes (including
the one the scan started with), where m is one less than the number of
1s encountered before finding that first 0 bit. Knock out all the bits
up to the first 0 bit, and the top 2 bits of all the rest, and
concatenate the remaining bits to get the char code.

Thus, we see the correspondence between UTF8 and unicode
Csaba
I found the following sites useful for seeing mappings and glyphs:
http://www.unicode.org/charts/About.html and
http://www.macchiato.com/unicode/chart/

Mar 18 '06 #4

Thomas 'PointedEars' Lahn

Csaba Gabor wrote:

Thomas 'PointedEars' Lahn wrote:
Csaba Gabor wrote:
> I just have one question at this point. As I mentioned in my original
> post, String.fromCharCode(2500) == "\u09C4" => %E0%A7%84
Those are percent-escaped representations of the three UTF-8 code ...
Would you please at least try to retain context in quotations?

<URL:http://learn.to/quote>
<URL:http://people.w3.org/rishida/scripts/uniview/conversion>.

Thanks Thomas, I like links.
It let me figure out the unicode / UTF8 mapping.
He's got a function, convertCP2UTF8 (spaceSeparatedHexValues) that does
essentially:

n = ...unicodeValue...
if (n <= 0x7F) return dec2hex2(n);
else if (n <= 0x7FF) return
dec2hex2(0xC0 | ((n>>6) & 0x1F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else if (n <= 0xFFFF) return
dec2hex2(0xE0 | ((n>>12) & 0x0F)) + ' ' +
dec2hex2(0x80 | ((n>>6) & 0x3F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else if (n <= 0x10FFFF) return
dec2hex2(0xF0 | ((n>>18) & 0x07)) + ' ' +
dec2hex2(0x80 | ((n>>12) & 0x3F)) + ' ' +
dec2hex2(0x80 | ((n>>6) & 0x3F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else return '!erreur ' + dec2hex(n);

In words: If your positive integer (the char code) is not less
than 17*16^4, report an error,

Yes. The error is reported if the value is greater than or equal to
0x110000, because The Unicode Standard, version 4.0, does not provide
for more than 1114112 code points, starting with code point U+0000.

(BTW: You have mis-wrapped your abstraction of the original source
code; a trailing `return' statement would only return `undefined',
not the evaluated value of the following lines.)
and If it is 7 bits or less (in the range (2^7,0], that is), just
return the two digit hex representation.
Yes. One (8-bit) UTF-8 code unit suffices to encode Unicode characters
at these code points.
[...]
With this it's also easy to see how to work from UTF-8 to unicode.
[...]
Thus, we see the correspondence between UTF8 and unicode
You are not making any sense. `n' is assigned the code point (CP)
number, which is then converted into UTF-8 code units, according to
the algorithms specified in The Unicode Standard, version 4.0.

<URL:http://en.wikipedia.org/wiki/Unicode>
[...]
I found the following sites useful for seeing mappings and glyphs:
http://www.unicode.org/charts/About.html and
http://www.macchiato.com/unicode/chart/

But obviously you have not found <URL:http://unicode.org/faq/> yet.
Please make it so.
PointedEars

Mar 18 '06 #5

Csaba Gabor

Thomas 'PointedEars' Lahn wrote:

Would you please at least try to retain context in quotations? I did.
You are not making any sense. `n' is assigned the code point (CP)
number, which is then converted into UTF-8 code units, according to
the algorithms specified in The Unicode Standard, version 4.0.

Sorry you didn't get it. It seems I was spot on in showing how to go
from CP number to the UTF-8 code units and back, as can be verified at
the nice
http://en.wikipedia.org/wiki/UTF-8

Csaba

Mar 19 '06 #6

Thomas 'PointedEars' Lahn

Csaba Gabor wrote:

Thomas 'PointedEars' Lahn wrote:
Would you please at least try to retain context in quotations? I did.

You did not. I wrote (at least):

| Those are percent-escaped representations of the three UTF-8 code
| units that are required to encode the Unicode character at code
| point U+09C4. [...]
| <URL:http://people.w3.org/rishida/scripts/uniview/conversion>.

You quoted me:

| > Those are percent-escaped representations of the three UTF-8 code ...
| > <URL:http://people.w3.org/rishida/scripts/uniview/conversion>.

You call that /retaining/ context? You even removed the "units" word.

You are not making any sense. `n' is assigned the code point (CP)
number, which is then converted into UTF-8 code units, according to
the algorithms specified in The Unicode Standard, version 4.0.

Thank you for destroying the context again.
Sorry you didn't get it.
YMMD.
It seems I was spot on in showing how to go from CP number to
the UTF-8 code units and back, as can be verified at the nice
http://en.wikipedia.org/wiki/UTF-8

What you think it seemed, and what you actually meant, is not relevant
regarding the question whether you have been making sense or not. You
said this shows the relation between Unicode and UTF-8, which is nonsense,
because the relation has always been there. UTF-8 is one possible encoding
to encode Unicode characters.

Better express yourself next time, this way you can avoid misunderstandings.
Score adjusted

PointedEars

Mar 19 '06 #7

Csaba Gabor

Thomas 'PointedEars' Lahn wrote:

Csaba Gabor wrote:
Thomas 'PointedEars' Lahn wrote:
Would you please at least try to retain context in quotations? I did.

You did not. I wrote (at least):

In fact, I did try. You are not an authority on me so I will
appreciate it if you will refrain from making assertions on
things you can not know.
| Those are percent-escaped representations of the three UTF-8 code
| units that are required to encode the Unicode character at code
| point U+09C4. [...]
| <URL:http://people.w3.org/rishida/scripts/uniview/conversion>.

You quoted me:

| > Those are percent-escaped representations of the three UTF-8 code ...
| > <URL:http://people.w3.org/rishida/scripts/uniview/conversion>.

You call that /retaining/ context? You even removed the ....

Yes.
Upon review, I find that I quoted exactly what I wanted to quote.

You are not making any sense. `n' is assigned the code point (CP)
number, which is then converted into UTF-8 code units, according to
the algorithms specified in The Unicode Standard, version 4.0.

Thank you for destroying the context again.
Sorry you didn't get it.

YMMD.
It seems I was spot on in showing how to go from CP number to
the UTF-8 code units and back, as can be verified at the nice
http://en.wikipedia.org/wiki/UTF-8

What you think it seemed, and what you actually meant, is not relevant
regarding the question whether you have been making sense or not.

In fact it is, since making sense is always subjective.
You said this shows the relation between Unicode and UTF-8, which is nonsense,
Really? Care to offer a quote for your assertion about what I said?
I never even used the word relationship in this thread.
because the relation has always been there. UTF-8 is one possible encoding
to encode Unicode characters.

Better express yourself next time, this way you can avoid misunderstandings.

Now that I have expressed myself, you might consider
expressing yourself better next time.
In particular, ordering and making demands on people is neither polite,
nor very effective on newsgroups where there is no means of enforcement.
If there is something that you would like to see done differently, then
it might be more expedient to point out what bothers you about it, and
suggest what would make you happier. Just saying "Don't" or "That was
nonesense" is not very constructive in forestalling future occurrences.

Csaba Gabor from Vienna

Apr 20 '06 #8

Similar topics

Writing UTF-8 string to UNICODE file

by: Michael Weir | last post by:

I'm sure this is a very simple thing to do, once you know how to do it, but I am having no fun at all trying to write utf-8 strings to a unicode file. Does anyone have a couple of lines of code...

Python

Unicode from Web to MySQL

by: Bill Eldridge | last post by:

I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5,...

Python

Unicode BOM marks

by: Francis Girard | last post by:

Hi, For the first time in my programmer life, I have to take care of character encoding. I have a question about the BOM marks. If I understand well, into the UTF-8 unicode binary...

Python

Adobe GoLive 6 - Nasty feature with UTF-8 encoding

by: Zenobia | last post by:

Recently I was editing a document in GoLive 6. I like GoLive because it has some nice features such as: * rewrite source code * check syntax * global search & replace (through several files at...

HTML / CSS

minidom xml & non ascii / unicode & files

by: webdev | last post by:

lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3...

Python

Revised PEP 349: Allow str() to return unicode strings

by: Neil Schemenauer | last post by:

python-dev@python.org.] The PEP has been rewritten based on a suggestion by Guido to change str() rather than adding a new built-in function. Based on my testing, I believe the idea is...

Python

Convertion of Unicode to ASCII NIGHTMARE

by: ChaosKCW | last post by:

Hi I am reading from an oracle database using cx_Oracle. I am writing to a SQLite database using apsw. The oracle database is returning utf-8 characters for euopean item names, ie special...

Python

encodeURI

by: polilop | last post by:

I'm having problems encoding URI. I have a page in which I use XMLHttpRequest to send a request. In my request it is possible to have Central European characters. When i send the request through...

Javascript

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp